Mapping of Sequence Reads to the Reference Genomes ◾ 53
the reference genome in a process known as read sequence mapping or alignment. In the
read mapping process, the FASTA files may contain millions of read sequences that we
wish to align to a sequence of a reference genome to produce aligned reads in a file format
called SAM, which stands for Sequence Alignment Map format. The aligned reads can also
be stored in the SAM binary form called BAM (Binary Alignment Map format). We will
discuss this file format later in some detail.
In general, sequence mapping or alignment requires three elements: A reference file in
the FASTA format, short-sequence reads in FASTQ files, and an aligner, which is a program
that uses an algorithm to align reads to a reference genome sequence. We have already
discussed how to download the sequence of a reference genome of an organism from the
NCBI Genome database. However, before using a reference genome with any aligner, it
may require indexing with the “samtools faidx” command. You can download and install
Samtools by following the instructions available at “http://www.htslib.org/download/”. On
Ubuntu, you can install it using the following command:
sudo apt-get install samtools
Once you have installed Samtools successfully, you can use that tool to index the reference
genome and other tasks that you will learn later.
You have already downloaded the human reference genome above. If you didn’t do that,
you can download and decompress it using the following commands:
mkdir refgenome
wget \
-O “refgenome/GRCh38.p13_ref.fna.gz” \
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.
fna.gz
cd refgenome
gunzip -d GRCh38.p13_ref.fna.gz
FIGURE 2.3 Part of the human annotation file in GTF file format.